(ignore at your own risk)
.ipynb file.All the necessary data sets can be downloaded from the following Google Drive folder: https://drive.google.com/drive/folders/187ekqZ8FSeQF9dCKV66eMXEBnLIlaqqF?usp=drive_link
However, it is recommended that you use the Raw URL address below since one of the data sets is quite large.
mystery_data.csv:
dopamine_reaction_time.csv:
insta.csv
While navigating the shadowy lower levels of the CAB catacombs in
search of caffeine, an intrepid student archaeologist stumbled upon a
dusty, long-forgotten treasure chest. Inside: a tattered file of data
called mystery_data.csv, sealed in wax and covered in
cryptic statistical runes.
Your objectives:
Use ggplot2 to create a bar plot that clearly conveys differences between conditions in the data, if any.
Add error bars representing one standard error above and below the classic mean (i.e., not a trimmed or robust version).
Inside a data frame, display (at a minimum) the following statistics for each condition:
library(tidyverse)
# Load data
data <- read_csv("data/mystery_data.csv")
# Get stats
plot_data <- data |>
group_by(condition) |>
summarise(
n = length(value),
m = mean(value),
se = sd(value) / sqrt(n)
)
plot_data
## # A tibble: 2 × 4
## condition n m se
## <chr> <int> <dbl> <dbl>
## 1 I 1160153 -2.220107 1.857286
## 2 II 1160554 -9.407373 1.856641
# Draw plot
ggplot(plot_data, aes(x = condition, y = m)) +
geom_bar(
stat = "identity",
colour = "black",
fill = "antiquewhite3") +
geom_errorbar(
aes(ymin = m - se, ymax = m + se),
width = 0.25
) +
xlab("Condition") +
ylab("Value")
Repeat question 1, but use 10% trimming and have the error bars represent two-sided 99% confidence intervals. Inside your data frame, make sure to also include the confidence interval’s top and bottom boundaries for each condition.
library(WRS2)
# Get stats
plot_data <- data |>
group_by(condition) |>
summarise(
n = length(value),
G = 0.1,
m = mean(value, tr = G),
s = sqrt(winvar(value, tr = G)),
se = s / ((1 - 2 * G) * sqrt(n)),
h = n - 2 * floor(G * n),
df = h - 1,
alpha = 0.01,
t_crit = abs(qt(alpha / 2, df = df)),
low_ci = m - t_crit * se,
top_ci = m + t_crit * se
)
plot_data
## # A tibble: 2 × 12
## condition n G m s se h df alpha
## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 I 1160153 0.1 -2.436849 1648.621 1.913256 928123 928122 0.01
## 2 II 1160554 0.1 -9.216171 1648.767 1.913095 928444 928443 0.01
## t_crit low_ci top_ci
## <dbl> <dbl> <dbl>
## 1 2.575835 -7.365080 2.491382
## 2 2.575835 -14.14399 -4.288354
# Draw plot
ggplot(plot_data, aes(x = condition, y = m)) +
geom_bar(
stat = "identity",
colour = "black",
fill = "antiquewhite4") +
geom_errorbar(
aes(ymin = low_ci, ymax = top_ci),
width = 0.25
) +
xlab("Condition") +
ylab("Value")
Using the boxplot rule for outlier detection, list the amount of
outliers found for condition “I” in mystery_data.csv. You
are not permitted to use the functions quantile() or
IQR().
# Grab condition 1 only
cond_1 <- filter(data, condition == "I")
# Store as vector and sort
cond_1 <- sort(cond_1$value)
N <- length(cond_1) # note that N is a odd number
# Bottom and Top of data
bot_half <- cond_1[1 : ((N+1) / 2)]
top_half <- cond_1[((N+1) / 2):N]
# Quartiles
q1 <- median(bot_half)
q3 <- median(top_half)
iqr <- q3 - q1
# Boundaries
bot <- q1 - 1.5 * iqr
top <- q3 + 1.5 * iqr
# Count
length(cond_1[cond_1 < bot | cond_1 > top])
## [1] 8034
A psychologist studying expert performance under high-pressure conditions is examining how consistently elite athletes deliver when the stakes are highest. One of the most electrifying players in modern hockey, Connor McDavid of the Edmonton Oilers, recorded an astonishing 153 points over 82 games during the 2022–2023 NHL season, averaging approximately 1.87 points per game.
Suppose McDavid’s per-game scoring follows a normal distribution with a mean of 1.87 points and a standard deviation of 0.55 points.
Now imagine this psychologist is watching from the stands at Rogers Place, during a crucial playoff game, wondering if McDavid will rise to the occasion once again.
Using this model, what is the probability that McDavid scores more than 2.295 points in a randomly selected game?
pnorm(2.295, mean = 1.87, sd = 0.55, lower.tail = FALSE)
## [1] 0.2198419
From the previous question, what is the probability that McDavid scores between 1 and 2 points in a randomly selected game?
# Probability of 2 points
p2 <- pnorm(2, mean = 1.87, sd = 0.55)
# Probability of 1 point
p1 <- pnorm(1, mean = 1.87, sd = 0.55)
# Between 2 and 1
p2 - p1
## [1] 0.5365792
In an informal study of Oilers fans’ playoff rituals, a local sports psychologist surveyed fans outside Rogers Place to see how many lucky items they bring with them to each playoff game. These included things like lucky jerseys, special pucks, autographed McDavid photos, and even one person’s mysterious “Victory Pierogi.”
The distribution of “lucky item counts” brought to the rink is shown in the histogram below.
Calculate the mean number of lucky items brought by fans to the game based on the histogram.
item_num <- c(0:6)
weights <- c(3, 13, 17, 9, 3, 8, 2)
sum(item_num * weights) / sum(weights)
## [1] 2.509091
A behavioral neuroscientist is interested in how a dopamine agonist affects motor response speed. Participants were randomly assigned to receive either a placebo or a low dose of a dopamine agonist before performing a computerized simple reaction time task. In each trial, a visual cue appeared at a random interval, and participants were instructed to press a key as quickly as possible in response.
The researcher computed each participant’s mean reaction time (in
milliseconds) across trials. They hypothesized that the dopamine
condition would result in faster overall reaction times. The
data can be found in dopamine_reaction_time.csv. Conduct an
appropriate statistical test that evaluates whether or not the
researcher’s hypothesis should be accepted.
Use of t.test() and yuen() is not
permitted.
In your output please report the following:
The null and alternative hypothesis you tested
The test statistic
Degrees of freedom
p-value
95% Confidence interval
Your conclusion
The normality assumption is reasonable; however, the assumption of equal variances is not. A Welch t-test is most appropriate.
Hypotheses:
\(H_0: \mu_{dope} - \mu_{pla} \geq 0\)
\(H_1: \mu_{dope} - \mu_{pla} < 0\)
# Get useful stats
stats <- rt |>
group_by(group) |>
summarise(
n = length(reaction_time),
m = mean(reaction_time),
q = var(reaction_time) / n
)
# Standard error
se <- sqrt(sum(stats$q))
# Degrees of freedom
df <- (sum(stats$q)^2) / sum(stats$q^2 / (stats$n - 1))
# Test stat
mu <- 0
t_stat <- ((stats$m[1] - stats$m[2]) - mu) / se
# p-value
p <- pt(t_stat, df = df)
# Confidence interval
alpha <- 0.05
t_crit <- qt(alpha, df = df, lower.tail = FALSE)
m_diff <- stats$m[1] - stats$m[2]
low_ci <- m_diff - Inf * se
top_ci <- m_diff + t_crit * se
ci <- paste0("(", low_ci, ", ", round(top_ci, 4), "]")
## T = -7.9149
## df = 209.0188
## p = 7.148693e-14
## 95% CI = (-Inf, -28.6164]
The alternative hypothesis posited faster reaction times in the dopamine condition. Given that \(p < 0.05\) and the null is therefore rejected, there is statistical support for the researcher’s claim.
Note:
The Null and Alternative Hypothesis need to be logically opposite of one another. The logic of hypothesis testing only works if that is the case. You cannot, for instance, have a null hypotheses like \(H_0: \mu_1 - \mu_2 = 0\) and an alternative like \(H_0: \mu_1 - \mu_2 < 0\).
Since our goal is to accept the researcher’s hypothesis (\(\mu_{dope} - \mu_{pla} < 0\)), it is framed as the alternative hypothesis because that is the only hypothesis it is possible to accept. Thus, the null hypothesis being tested needs to be the logical opposite of that (i.e., \(\mu_{dope} - \mu_{pla} \geq 0\)).
A recent exposé revealed that the Instagram app has been secretly logging private usage data from its users and transmitting it to its parent company, Meta. Following the data leak, major news outlets reported that Zoomers (members of Gen Z between the ages of 13 and 28) spend an average of 6 hours per day on their smartphones.
You manage to obtain a random sample of screen time data from
Zoomers. Using this sample, conduct an appropriate test that
evaluates whether the media’s claim of a 6-hour average should be
rejected. The data is in insta.csv.
Clearly state your null and alternative hypotheses.
Report the test statistic, p-value, 95% Confidence Interval.
\(H_0: \mu = 6\)
\(H_1: \mu \neq 6\)
# Load data
screen <- read_csv("data/insta.csv")
# Assess normality
ggplot(screen, aes(sample = time_hr)) +
stat_qq() +
stat_qq_line()
Data is non-normal, thus the assumption of a normally distributed population is unreasonable. A trimmed one-sample t-test is required.
library(WRS2)
G = 0.2
N <- nrow(screen)
m <- mean(screen$time_hr, tr = G)
s <- sqrt(winvar(screen$time_hr, tr = G))
# Standard Error
se <- s / ((1 - 2 * G) * sqrt(N))
# Degrees of Freedom
h <- N - 2 * floor(G * N)
df <- h - 1
# Test stat
mu <- 6
t_stat <- (m - mu) / se
# P-value
p <- pt(t_stat, df = df, lower.tail = FALSE) * 2
# 95% CI
alpha <- 0.05
t_crit <- abs(qt(alpha / 2, df = df))
low_ci <- m - t_crit * se
top_ci <- m + t_crit * se
ci <- paste0("[", round(low_ci, 3), ", ", round(top_ci, 3), "]")
## T = 4.3186
## df = 350
## p = 2.04713e-05
## 95% CI = [6.066, 6.175]
Since p < 0.05 and/or the 95% CI does not contain 6, we reject the media’s claim.
Considering the values obtained in the previous question’s analysis, do you believe the media’s claim is unreasonable? Why? (No calculation is required for this question)
The estimated mean population screentime is \(6.12\) hours, while the media claims it to be 6 hours. This represents a difference of approximately 7.2 minutes. Although this difference is statistically significant, it is unlikely to be practically meaningful for most people (i.e., the effect is small). Therefore, the media’s claim is not unreasonable.
Suppose you record the number of times a neuron fires per second in response to a stimulus. What scale of measurement does this represent?
What scale of measurement would differences between rates of neuron firing be?
You measure the amount of dopamine (in nanograms per milliliter) present in a rat’s nucleus accumbens after exposure to a drug. What scale of measurement is this?
You group EEG data based on whether the participant was in a resting, task, or sleep condition. What scale of measurement is this?
A researcher counts how many milliseconds it takes participants to identify an image of a fearful face. What scale of measurement is used?
Subjects rate their pain during a mild electric shock using a 5-point scale from ‘no pain’ to ‘extreme pain’. What kind of measurement scale is this?
A neuroscientist records brain temperature (in Celsius) before and after a task. What scale of measurement does this involve?
Participants are asked to rank a series of images based on how emotionally disturbing they find them, from least to most disturbing. What scale of measurement is this?
In a memory experiment, participants are categorized based on the brain region that showed the most activation (e.g., hippocampus, amygdala, prefrontal cortex). What scale of measurement does this reflect?